Machine learning reveals limited contribution of trans-only encoded variants to the HLA-DQ immunopeptidome

Human leukocyte antigen (HLA) class II antigen presentation is key for controlling and triggering T cell immune responses. HLA-DQ molecules, which are believed to play a major role in autoimmune diseases, are heterodimers that can be formed as both cis and trans variants depending on whether the α- and β-chains are encoded on the same (cis) or opposite (trans) chromosomes. So far, limited progress has been made for predicting HLA-DQ antigen presentation. In addition, the contribution of trans-only variants (i.e. variants not observed in the population as cis) in shaping the HLA-DQ immunopeptidome remains largely unresolved. Here, we seek to address these issues by integrating state-of-the-art immunoinformatics data mining models with large volumes of high-quality HLA-DQ specific mass spectrometry immunopeptidomics data. The analysis demonstrates highly improved predictive power and molecular coverage for models trained including these novel HLA-DQ data. More importantly, investigating the role of trans-only HLA-DQ variants reveals a limited to no contribution to the overall HLA-DQ immunopeptidome. In conclusion, this study furthers our understanding of HLA-DQ specificities and casts light on the relative role of cis versus trans-only HLA-DQ variants in the HLA class II antigen presentation space. The developed method, NetMHCIIpan-4.2, is available at https://services.healthtech.dtu.dk/services/NetMHCIIpan-4.2.

Supplementary figure 11: Sequence-based clustering of DQ molecules. The tree is based on 61 DQ molecules including the 14 molecules described by the novel data. Orange molecules are covered by the method including the novel data with at least 100 peptides, and blue molecules are within a distance 0.025 of an orange molecule. Black molecules are non-covered (i.e. have peptide count less than 100 and have distance greater than 0.025 to an orange molecule). Logos in black frames correspond to orange molecules. Logos in red frames correspond to molecules from branches with clusters of non-covered (black) molecules. The phylogenetic tree was constructed from the DQ pseudo-sequences using ClustalW. Logos were constructed from the top 1% of 100,000 random 13-17 mer peptides.  Each point shows the mean per-dataset peptide fraction for a given DQ molecule. Each boxplot shows the median inside the IQR between the upper and lower quartiles, with whiskers extending to at most 1.5 times the IQR. For each method, trans-only molecules are shown in one boxplot (n=7), while cis molecules are shown in three categories, namely all cis molecules (Cis -All, n=21), cis molecules found in the DQ-SA training data (Cis -SA, n=9), and cis molecules only found in the DQ-MA training data (Cis -MA, n=12 for NetMHCIIpan-4.2 and n=13 for MixMHC2pred-2.0).

Supplementary figure 13B: Peptide-count contribution of cis and trans-only molecules predicted by NetMHCIIpan-4.2 and MixMHC2pred-2.0 on DQ-heterozygous data from Marcu et al. 2021, taking into account pseudo-sequence overlap.
Each point shows the mean per-dataset peptide fraction for a given DQ molecule. Each boxplot shows the median inside the IQR between the upper and lower quartiles, with whiskers extending to at most 1.5 times the IQR. For each method, trans-only molecules are shown in one boxplot (n=7), while cis molecules are shown in three categories, namely all cis molecules (Cis -All, n=21), cis molecules found in the DQ-SA training data or with the same pseudo-sequence as a DQ-SA molecule (Cis -SA, n=13 for NetMHCIIpan-4.2 and n=14 for MixMHC2pred-2.0), and cis molecules only found in the DQ-MA training data and with no pseudo-sequence overlap to cis-SA molecules (Cis -MA, n=8). Here, a significant difference was found between cis-MA and transonly in NetMHCIIpan-4.2 (t=3.6, p=0.003, n=8 cis-MA molecules and n=7 trans-only molecules, two-sided t-test).
Supplementary figure 13C: DQ motif deconvolution by our method for DQ-heterozygous datasets in the benchmark data from Marcu et al. 2021. Predictions were made without peptide context encoding. Each row corresponds to a donor sample. Only peptides with percentile rank less than 10 were included in the logo plots. The number of peptides used to create each motif is shown in parenthesis above the given logo. Trans-only molecules are highlighted in red frames. Only peptides with percentile rank less than 10 were included in the logo plots. The number of peptides used to create each motif is shown in parenthesis above the given logo. Trans-only molecules are highlighted in red frames

HLA-DR 4688
Trash 2805 Peptides uniquely assigned to HLA-DQ in wo_Saghar method Distribution of peptide annotations in w_Saghar method

HLA-DR 2629
Trash Supplementary table 2: Overview of peptides assigned to DQ molecules in the methods with (w_Saghar) and without (wo_Saghar) the novel data. Trash peptides with percentile rank greater than 20 are not included in the metrics. A bold value indicates either a higher average peptide count or a lower mean/median percentile rank in a given method. Supplementary table 3: Overview of consistency analysis in the methods with (w_Saghar) and without (wo_Saghar) the novel data. The molecules are sorted in descending order by the difference in mean consistency. Further, the metrics are calculated on the peptide sets used in the consistency analysis, with the union of identified trash peptides removed. As such, the metrics regarding differences in percentile ranks may not correspond one-to-one with the values in supplementary table 4.